Investigation into the Relationship between GDP and Child Mortality Rates over Time

STAT 331 Final Project

Author

Alex Sullivan, Megan Kerner, Isabelle Phraner, Vi-Linh Vu

Reading in Data

Code
# | code-fold: true
library(tidyverse)
library(here)
library(broom)
library(purrr)
library(knitr)
library(gganimate)
child_mortality <- read_csv("data/child_mortality_0_5_year_olds_dying_per_1000_born.csv")
GDP_per_Cap <- read_csv("data/gdp_pcap.csv")

1 Data Description and Cleaning Process

In this analysis, we will be exploring the relationship between child mortality rate and GDP per capita across various countries over time. We utilize two primary datasets:

  1. Child Mortality Dataset: Combines data from multiple sources including Gapminder (v7) for years 1800 to 1950, UNIGME (a collaboration among UNICEF, WHO, the UN Population Division, and the World Bank) for years 1950 to 2016, and UN POP (World Population Prospects 2019) for projections up to 2100.

  2. GDP per Capita Dataset: Aggregates GDP per capita data from the World Bank, the Maddison Project Database, and the Penn World Table, offering a comprehensive view of economic performance across nations over time.

To ensure the relevance and accuracy of our analysis, we confined our examination to the period between the 1950s and 2020. A few factors motivated this decision. First, due to timelines of independence and colonial rule, many of the countries in the dataset did not formally exist until the 1950s or 1960s, and were instead territories of world powers such as England. Second, by refining our time frame, we are able to analyze historical trends without projected data beyond 2020 skewing our results. Having more consistent data also allows us to minimize the impact of NA values.


Code
# | code-fold: true
# Refine child_mortality dataset
child_mortality <- child_mortality |>
  select(country, `1950`:`2020`)

# Refine GDP_per_Capita dataset
GDP_per_Cap <- GDP_per_Cap |>
  select(country, `1950`:`2020`)

child_mortality_long <- child_mortality |>
  pivot_longer(
    cols = !country,
    names_to = "Year",
    values_to = "Mortality Rate"
  )

GDP_per_Cap_long <- GDP_per_Cap |>
  pivot_longer(
    cols = !country,
    names_to = "Year",
    values_to = "GDP Per Capita"
  )

# function converts numbers in "XXk" format to numeric
# eg "89k" -> 89,000
convert_k_to_number <- function(x) {
  if (grepl("k", x, fixed = TRUE)) {
    numeric_value <- as.numeric(sub("k", "", x)) * 1000
  } else {
    numeric_value <- as.numeric(x)
  }
  return(numeric_value)
}

# converts mortality rate to numeric
child_mortality_long <- child_mortality_long |>
  filter(!is.na(child_mortality_long$`Mortality Rate`)) |>
  mutate(`Mortality Rate` = as.numeric(`Mortality Rate`))

# converts gdp to numeric
GDP_per_Cap_long <- GDP_per_Cap_long |>
  mutate(`GDP Per Capita` = sapply(`GDP Per Capita`, convert_k_to_number))

joined_df <- inner_join(child_mortality_long, GDP_per_Cap_long, join_by(country, Year)) |>
  mutate(Year = as.numeric(Year))

We pivoted the data to be longer so it can be easily analyzed (on mortality rate and GDP per capita for each year). Then, we converted the values to be numeric as they represent quantitative data. For the GDP, we had to convert some values that had a “k” in it to the actual multiple of 1000. The final dataset has 195 different countries, spanning across 71 years.

1.1 Variable Description

The first variable in our joined data set is “country”, which specifies the country that the following data describes. The next variable is “Year”, which specifies the year in which the following data was collected. In our data, we limited our “Year” variable from 1950-present because many countries did not exist and therefore had missing data prior to then. The “Mortality Rate” variable represents the number of deaths of children under 5 years old per 1000 live births. Finally, the “GDP Per Capita” variable represents the gross domestic product per person in USD. The data for “GDP Per Capita” was also adjusted for differences in purchasing power.

According to the International Monetary Fund, GDP is a monetary measure of the market value of all the final goods and services produced in a specific time period by a country or countries. GDP is often used by the government of a single country to measure its economic health.

Our hypothesis predicts a negative correlation between GDP per capita and child mortality rates, based on the assumption that higher economic status, as indicated by GDP per capita, is associated with enhanced access to healthcare, improved living conditions, and higher investment in public health infrastructure. These conditions are expected to contribute to lower child mortality rates, reflecting better health outcomes in economically prosperous countries.

We also expect a significant decrease in child mortality rates as time progresses, echoing global health advancements and the intensification of international efforts to combat child mortality. This expectation is supported by historical data, including reports from the World Health Organization, which indicate a decline of over 59% in child mortality rates since 1990. Such trends underscore the impact of both economic development and targeted health interventions in improving child survival rates.

2 Linear Regression

In our study, we employ linear regression to examine the potential impact of GDP per capita on child mortality rates. Linear regression is an appropriate statistical method for this analysis as it allows us to model the relationship between a continuous response variable and one or more explanatory variables. We hypothesize that GDP per capita (explanatory variable, \(x\)) inversely affects the child mortality rate (response variable, \(y\)), with the expectation that higher GDP per capita is associated with lower child mortality rates.

As a preliminary check, we created two scatterplots to visually assess the relationship between GDP per capita and the Child Mortality Rate over time. Our aim was to discern the form, direction, strength, and presence of any unusual observations in the relationship between these two variables.

Code
# | code-fold: true
year_avgs <- joined_df |>
  group_by(country) |>
  summarise(
    Avg_Mortality_Rate = mean(`Mortality Rate`, na.rm = TRUE),
    Avg_GDP_Per_Capita = mean(`GDP Per Capita`, na.rm = TRUE)
  )

2.1 Visualizations

Code
# | code-fold: true
year_avgs |>
  ggplot(aes(x = Avg_GDP_Per_Capita, y = Avg_Mortality_Rate)) +
  geom_point(alpha = 0.5, color = 'blue') +
  theme_minimal() +
  labs(x = "Average GDP Per Capita (in USD)", y = "Average Mortality Rate",
       title = "Average Child Mortality Rate vs Average GDP Per Capita by Country")

The scatterplot above of average GDP per capita against average mortality rate shows a general curved downward trend, suggesting a negative correlation where higher GDP per capita is associated with lower child mortality rates. The relationship does exhibit a non-linear form, with the strength of the relationship varying across the range of GDP values. Notably, some countries with similar GDP per capita exhibit a wide range of mortality rates, indicating other factors may influence mortality beyond GDP alone.

Code
# | code-fold: true
joined_df |>
  mutate(Decade = floor(Year / 10) * 10) |>
  ggplot(aes(x = `GDP Per Capita`, y = `Mortality Rate`)) +
  geom_point(alpha = 0.25, color = 'purple') +
  facet_wrap(~ Decade, scales = "free") +
  labs(title = "GDP Per Capita vs Child Mortality Rate Over Time",
       x = "GDP Per Capita (in USD)",
       y = "Child Mortality Rate") +
  theme_minimal()

Code
# | code-fold: true
joined_df |>
  mutate(Decade = floor(Year / 10) * 10) |>
  ggplot(aes(x = `GDP Per Capita`, y = `Mortality Rate`)) +
  geom_point(alpha = 0.25, color = 'purple') +
 labs(title = 'Year: {frame_time}', x = 'GDP Per Capita (in USD)', y = 'Child Mortality Rate') +
  transition_time(as.integer(Year)) +
  ease_aes('linear')

The second graph plots the GDP per capita per against the child mortality rate and is faceted by decade in order to see how the relationship between the two variables change over time. At a glance, it does not appear that the relationship between the two variables change drastically over the decades as all of the grids have a are similar in form and direction, all having a downward concave shape. However, we do see that the density of points on the left side of the plots decreases over time especially from 1950-1970 which could possibly indicate the relationship between the variables strengthening. While the overall shape of the relationship between the variables is similar across decades, it’s evident that the actual numerical values do vary over time, as indicated by the differing scales on each graph.

In evaluating the appropriateness of the model, we consider the LINE assumptions: Linearity: The relationship between GDP per capita and child mortality rate appears linear, Independence: Observations are independent of each other, Normality: The residuals of the model are normally distributed, Equal Variance: The variance of the error terms is constant across values of GDP per capita.

Based on our scatter plots so far, the model does not appear to be very linear, and instead exhibits the shape of either a logarithmic function or of exponential decay. We can however confirm that the data are independent: it’s unlikely that American child mortality has an effect on Angolan or Brazilian child mortality. Based on trade relationships though, it is possible that American, British, and German GDPs may have relationships with each other.

2.2 Regression and Model Fit

In addition to the visualizations, we also fitted the data to a linear regression model to obtain the regression equation.

Code
# | code-fold: true
mortalitygdp_lm <- lm(Avg_Mortality_Rate ~ Avg_GDP_Per_Capita,
  data = year_avgs)

mortality_regression <- tidy(mortalitygdp_lm)

kable(mortality_regression, caption = "GDP Per Capita Regression Model")
GDP Per Capita Regression Model
term estimate std.error statistic p.value
(Intercept) 128.4115700 5.3256071 24.112100 0
Avg_GDP_Per_Capita -0.0025385 0.0002628 -9.658573 0

Based on the results of our linear regression, we use the equation \[{y} = 128 - 0.00254x + 5.3\] to represent the relationship between GDP and child mortality. In this model, \({y}\) represents the predicted child mortality rate, \(x\) represents represents the GDP per capita, \(b0\) (intercept) indicates the child mortality rate when GDP per capita is zero, \(b1\) (slope) represents the change in child mortality rate for a one-unit increase in GDP per capita, and ε represents the error.

We find that the average child mortality rate is 128.4 when GDP is zero. We also find that, holding other factors constant, a one-unit ($1 USD) increase in GDP corresponds to the child mortality rate decreasing -0.0025 on average.

In addition to visually assessing the appropriateness of a linear regression model, we will also evaluate it by looking at the amount of variability in the response, fitted, and residual values. To do this, we generate the variance for each and based on those numbers, evaluate the quality of the model.

Code
# | code-fold: true
aug <- broom::augment(mortalitygdp_lm)
variance_table <- data.frame(
  Variable = c("Response Values", "Fitted Values", "Residuals"),
  Variance = c(var(aug$Avg_Mortality_Rate), 
               var(aug$.fitted), 
               var(aug$.resid))
)
kable(variance_table, caption = "Variance Table") 
Variance Table
Variable Variance
Response Values 4715.485
Fitted Values 1536.559
Residuals 3178.926

The variance table derived from our model allows us to calculate the R-squared using the formula \(R^2 = \frac{{\text{var}( \hat{y})}}{{\text{var}(y)}}\), with \(\hat{y}\) being the variance of the fitted values and \(y\) being the variance of the response values. Plugging these values into our formula, we get an \(R^2\) value of approximately 0.3258. This indicates that approximately 32.58% of the variance in child mortality rates is explained by GDP per capita. This suggests that the model captures a moderate amount of the variation in the dependent variable. It is then probable that there is a presence of other factors affecting child mortality that is not captured by GDP per capita alone as there is still a significant amount of unexplained variability. The measurement of child mortality rates was based on the probability of dying between birth and age 5 per 1,000 live births, providing a standardized metric for international health comparisons. Overall, these results suggest that the quality of the model is average, because it explains some variation in the dependent variable, but does not capture a significant amount of it.

3 Simulation

3.1 Visualizing Simulations from the Model

In this portion of the analysis, we will assess the robustness of our linear regression model by generating simulated data from the linear model, and comparing it with our observed data. This simulation allows us to envision how the model might perform under different scenarios that align with the observed variance. If the observed and simulated data closely resemble each other, it reinforces the model’s adequacy in capturing the essence of the relationship between GDP and mortality rates.

However, after obtaining and graphing the simulated data, the side-by-side plots reveal the observed data’s tighter clustering at lower GDP values, diverging at higher values, whereas the simulated data display a more uniform dispersion. The observed data also has a downward curved form, whereas the simulated data has a more downward linear form. This discrepancy suggests the presence of non-linearities or varying error terms across the GDP spectrum that the linear model does not account for, indicating that alternative models may provide a more nuanced understanding of the data, and that our current linear regression model may not be the best fit.

Code
# | code-fold: true
# 1. generate predictions using  predict() 
predictions <- predict(mortalitygdp_lm, newdata = year_avgs)

# 2. add random errors to the predictions w/ residual standard error (acquired with sigma())
std_error <- sigma(mortalitygdp_lm)
simulated_errors <- rnorm(n = length(predictions), mean = 0, sd = std_error)

# 3. plotting
simulated_year_avgs <- year_avgs|>
    mutate(Simulated_Mortality = predictions + simulated_errors)

# obs
observed_plot <- ggplot(year_avgs, aes(x = Avg_GDP_Per_Capita, y = Avg_Mortality_Rate)) +
  geom_point(aes(color = "Observed"), alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
  theme_minimal() +
  labs(title = "Observed",subtitle = "Average Child Mortality Rate", x= "Average GDP Per Capita ($)", y = "",
       color = "Data Type")


# sim
simulated_plot <- ggplot(simulated_year_avgs, aes(x = Avg_GDP_Per_Capita, y = Simulated_Mortality)) +
  geom_point(aes(color = "Simulated"), alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "red", linetype = "dashed") +
  theme_minimal() +
  labs(title = "Simulated", subtitle = "Average Child Mortality Rate", x= "Average GDP Per Capita ($)", y = "",
       color = "Data Type")


library(patchwork)

(observed_plot | simulated_plot) + 
  patchwork::plot_layout(guides = 'collect') & 
  theme(legend.position = 'bottom')

3.2 Generating Multiple Predictive Checks

To further assess our model, we generated 1000 simulated datasets using the model and regressed it against the observed dataset, retaining the \(R^2\) value from each.

Code
# | code-fold: true

sd <- sigma(mortalitygdp_lm)
noise <- function(x, mean = 0, sd){
  x + rnorm(length(x), 
            mean, 
            sd)
}

nsims <- 1000
sims <- map_dfc(.x = 1:nsims,
                .f = ~ tibble(sim = noise(predictions, 
                                          sd = sd)
                              ))
colnames(sims) <- colnames(sims) |> 
  str_replace(pattern = "\\.\\.\\.",
                  replace = "_")

sims <- year_avgs |> 
  filter(!is.na(`Avg_GDP_Per_Capita`)) |> 
  select(`Avg_Mortality_Rate`)|> 
  bind_cols(sims)



sim_r_sq <- sims |> 
  map(~ lm(Avg_Mortality_Rate ~ .x, data = sims)) |> 
  map(glance) |> 
  map_dbl(~ .x$r.squared)

sim_r_sq <- sim_r_sq[names(sim_r_sq) != "Avg_Mortality_Rate"]


tibble(sims = sim_r_sq) |> 
  ggplot(aes(x = sims)) + 
  geom_histogram(binwidth = 0.025) +
  labs(x = expression(R^2 ~ "values"),
       y = "",
       subtitle = "Number of Simulated Models") +
  theme_bw()

The distribution depicted in the graph illustrates the spread of \(R^2\) values derived from numerous regression analyses conducted on simulated datasets. Characterized by a roughly normal shape slightly skewed to the left, the distribution suggests that a significant portion of the regression models demonstrate moderate explanatory power. The distribution being centered around 0.1 indicates that, on average, the models explain approximately 10% of the variability in average child mortality rate. This implies that while the model captures a portion of the variation, there’s still substantial unexplained variability, suggesting that factors beyond GDP per capita contribute significantly to variations in child mortality rates. Therefore, while the model provides some insight into the relationship between GDP per capita and Child Mortality Rate, the predictive power of the models, on average, is relatively low, so they may have limited utility in accurately predicting child mortality rates based solely on GDP per capita.